graph TD
X[sleep-problems] --> Y_t[anxiety_t]
2024-09-26
McNeish et al. (2016) provides a nice summary of the differences
Parameter interpretation
Added comments:
To hone in on the understanding of the estimation approaches, we want to keep stable across simulations:
And we want to vary:
Possible research question
Let’s draw a DAG where we are estimating the effect of \(X\) (sleep-problems) on \(Y_t\) (anxiety)
Let’s specify the structural causal model (SCM) for the drawn DAG
\[X_i := \mathcal{N}(5, 1) \quad \text{for} \quad i = 1, \ldots, n_{\text{individuals}}\]
\[Y_{it} := 2 + 0.6 \cdot X_i + \epsilon_{it} \quad \text{for} \quad i = 1, \ldots, n_{\text{individuals}} \quad \text{and} \quad t = 1, \ldots, n_{\text{measurements}}\] where \(\epsilon_{i,t}\) follows a multivariate normal distribution with mean zero and an exchangeable correlation structure across time points \(t\):
\[\epsilon_{i,1}, \ldots, \epsilon_{i,n_{\text{measurements}}} \sim \mathcal{N}(0, \Sigma)\]
The covariance matrix \(\Sigma\) has the form:
\[\Sigma = \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \cdots & 1 \end{bmatrix}\]
with \(\rho = 0.7\), representing the exchangeable correlation structure between the errors at different time points within an individual.
# Parameters
set.seed(123) # For reproducibility
n_individuals <- 100 # Number of individuals
n_measurements <- 5 # Number of measurements per individual
rho <- 0.7 # Correlation between measurements within an individual
# Generate individual-level covariates (X)
X <- rnorm(n_individuals, mean = 5, sd = 1)
# Correlation matrix (exchangeable structure)
corr_matrix <- matrix(rho, n_measurements, n_measurements)
diag(corr_matrix) <- 1 # Diagonal is 1
# Simulate repeated measures for each individual
ex1.data <- data.frame()
for (i in 1:n_individuals) {
# Generate errors with the working correlation structure
epsilon_Y <- mvrnorm(1, mu = rep(0, n_measurements), Sigma = corr_matrix)
# Generate Y for this individual across measurement occasions
Y <- 2 + 0.6 * X[i] + epsilon_Y
# Store the data
individual_data <- data.frame(
id = i,
occasion = 1:n_measurements,
X = X[i],
Y = Y
)
ex1.data <- rbind(ex1.data, individual_data)
}
# Make sure first timepoint is zero (for intercept interpretation)
ex1.data$occasion <- ex1.data$occasion - 1
# Check the first few rows of the data
head(ex1.data, 10) id occasion X Y
1 1 0 4.439524 5.320932
2 1 1 4.439524 4.794407
3 1 2 4.439524 5.276651
4 1 3 4.439524 5.486515
5 1 4 4.439524 5.536658
6 2 0 4.769823 4.545282
7 2 1 4.769823 5.312338
8 2 2 4.769823 4.481625
9 2 3 4.769823 5.706999
10 2 4 4.769823 4.459495
The MLM and GEE model equations share (1) outcome \(Y_{it}\), (2) predictor \(X_i\) and (3) time-varying predictor \(T_{it}\).
MLM model equations (including random effects)
where:
GEE model equations
where:
Let’s explore the observed correlation matrix first
data_wide <- ex1.data %>%
pivot_wider(names_from = occasion, values_from = Y, names_prefix = "Y_")
# Remove the ID column
data_wide_Y <- data_wide %>% dplyr::select(starts_with("Y_"))
# Compute the observed correlation matrix for the measurements across all individuals
observed_corr_matrix <- cor(data_wide_Y, use = "pairwise.complete.obs")
# Display the observed correlation matrix
knitr::kable(observed_corr_matrix, digits = 3)| Y_0 | Y_1 | Y_2 | Y_3 | Y_4 | |
|---|---|---|---|---|---|
| Y_0 | 1.000 | 0.775 | 0.790 | 0.814 | 0.755 |
| Y_1 | 0.775 | 1.000 | 0.810 | 0.822 | 0.791 |
| Y_2 | 0.790 | 0.810 | 1.000 | 0.822 | 0.845 |
| Y_3 | 0.814 | 0.822 | 0.822 | 1.000 | 0.813 |
| Y_4 | 0.755 | 0.791 | 0.845 | 0.813 | 1.000 |
Aligning with the simulation, the matrix seems to be more or less exchangeable.
\(\to\) we specify the working correlation matrix \(R\) to be exchangeable
We first estimate the regression coefficients as if the data were independent (ignoring the correlation between measurements within individuals). This step provides initial estimates of \(\beta\).
# Fit a generalized linear model (GLM) assuming independence
# Example using a linear model (gaussian family with identity link)
glm_fit <- glm(Y ~ X + occasion, data = ex1.data, family = gaussian(link = "identity"))
# Display the summary of the GLM fit
coef(glm_fit) (Intercept) X occasion
2.0969471228 0.5647743308 0.0002371629
The gee function also displays the information for the initial regression estimate
# Checking if this aligns with GEE function
ex1.GEE <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "exchangeable") (Intercept) X occasion
2.0969471228 0.5647743308 0.0002371629
This aligns with the fixed effects estimates from the glm function.
We compute the residuals \(e_{it}\) based on the regression model we just fit:
Next, we estimate the initial model assuming independence to get preliminary parameter estimates. Then, we compute the residuals \(e_{it}\) as:
With the working correlation structure \(R\), we update the working voriance-covariance matrix \(V_i\) for individual \(i\):
\[V(a) = \phi A_i^{1/2} R_i(a) A_i^{1/2}\]
where \(A_i\) is the diagonal matrix of variances for individual \(i\) and \(\phi\) is a scale parameter.
in the case of normally distributed outcome with homogeneous variance across time:
\[V(a) = \phi R_i(a) \]
After iterating between step 3 and step 5 until convergence is reached, we apply the sandwich estimator to obtain robust standard errors.
# The GEE model summary provides robust standard errors using the sandwich estimator
robust_se <- summary(ex1.GEE)$coefficients
# Display the robust standard errors
knitr::kable(robust_se, digits = 3)| Estimate | Naive S.E. | Naive z | Robust S.E. | Robust z | |
|---|---|---|---|---|---|
| (Intercept) | 2.097 | 0.537 | 3.902 | 0.603 | 3.476 |
| X | 0.565 | 0.104 | 5.445 | 0.111 | 5.106 |
| occasion | 0.000 | 0.017 | 0.014 | 0.018 | 0.013 |
there is a large difference between the naive and robust standard errors
Let’s estimate the effect of \(X\) on \(Y_t\) using MLM (often done with model building)
# Model 1a: Intercept only model, random intercept
ex1.MLM1a <- lmer(Y ~ 1 + (1 | id), data = ex1.data, REML = FALSE)
# Model 1b: Intercept only model including the predictor time
ex1.MLM1b <- lmer(Y ~ 1 + occasion + (1 | id), data = ex1.data, REML = FALSE)
# anova(ex1.MLM1a, ex1.MLM1b) # Model 1a is not significantly worse than Model 1b
# Model 2: including remainder of the level 1 predictors
ex1.MLM2 <- lmer(Y ~ 1 + occasion + X + (1 | id), data = ex1.data, REML = FALSE)
# anova(ex1.MLM1b, ex1.MLM2) # Model 1b is significantly worse than Model 2
fixef(ex1.MLM2) (Intercept) occasion X
2.0969471228 0.0002371629 0.5647743308
However, we are only really interested in the last model (with random slope and intercept)
GEE Models
ex1.GEE_exch <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "exchangeable") (Intercept) X occasion
2.0969471228 0.5647743308 0.0002371629
ex1.GEE_ind <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "independence") (Intercept) X occasion
2.0969471228 0.5647743308 0.0002371629
(Intercept) X occasion
2.0969471228 0.5647743308 0.0002371629
ex1.GEE_uns <- gee(Y ~ X + occasion, data = ex1.data, id = id, family = gaussian(), corstr = "unstructured") (Intercept) X occasion
2.0969471228 0.5647743308 0.0002371629
Let’s compare the fixed effects
# Extracting the coefficients
coefs_MLM_reml <- fixef(ex1.MLM3.1_reml)
coefs_MLM_mle <- fixef(ex1.MLM3.1_mle)
# reorder coefficients X and occasion
coefs_MLM_reml <- coefs_MLM_reml[c(1, 3, 2)]
coefs_MLM_mle <- coefs_MLM_mle[c(1, 3, 2)]
# continue with GEE estimates
coef_GEE_exch <- coef(ex1.GEE_exch)
coef_GEE_ind <- coef(ex1.GEE_ind)
coef_GEE_ar <- coef(ex1.GEE_ar)
coef_GEE_uns <- coef(ex1.GEE_uns)
# Creating a data frame with the coefficients
coefs <- data.frame(
row.names = c("Intercept", "X", "occasion"),
MLM_REML = coefs_MLM_reml,
MLM_MLE = coefs_MLM_mle,
GEE_exch = coef_GEE_exch,
GEE_ind = coef_GEE_ind,
GEE_ar = coef_GEE_ar,
GEE_uns = coef_GEE_uns
)
# Create nice table
knitr::kable(coefs, digits = 3)| MLM_REML | MLM_MLE | GEE_exch | GEE_ind | GEE_ar | GEE_uns | |
|---|---|---|---|---|---|---|
| Intercept | 2.092 | 2.092 | 2.097 | 2.097 | 2.178 | 2.069 |
| X | 0.566 | 0.566 | 0.565 | 0.565 | 0.555 | 0.570 |
| occasion | 0.000 | 0.000 | 0.000 | 0.000 | -0.006 | 0.003 |
The structural causal model was specified as follows
\[Y_{it} := 2 + 0.6 \cdot X_i + \epsilon_{it}\]
Observations